Automatic Sanskrit Segmentizer Using Finite State Transducers

نویسنده

  • Vipul Mittal
چکیده

In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word, the second approach generates all possible segmentations and validates each constituent using a morph analyzer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Avyaya Analyzer: Analysis of Indeclinables using Finite State Transducers

In Sanskrit, avyaya i.e., the indeclinable, plays an important role in the construction of a sentence and can be used as preposition, interjection, particle, conjunction or an adverb. To understand a given sentence, a thorough cognizance of the avyaya is necessary. This paper briefly describes the avyayas, types of avyayas and role of avyayas. An avyaya analysis tool (avyaya Morphological Analy...

متن کامل

Applying the OCRopus OCR System to Scholarly Sanskrit Literature

OCRopus is an open source OCR system currently being developed, intended to be omni-lingual and omni-script. In addition to modern digital library applications, applications of the system include capturing and recognizing classical literature, as well as the large body of research literature about classics. OCRopus advances the state of the art in a number of ways, including the ability easily ...

متن کامل

Bilingual Clustering Using Monolingual Algorithms

The use of bilingual word classes greatly reduces the amount of data needed for training subsequential transducers, a finite state model adequate for small to medium translation tasks. We present an automatic approach to derive these classes using traditional monolingual word clustering methods.

متن کامل

Lexicon-directed Segmentation and Tagging of Sanskrit

We propose a methodology for Sanskrit processing by computer. The first layer of this software, which analyses the linear structure of a Sanskrit sentence as a set of possible interpretations under sandhi analysis, is operational. Each interpretation proposes a segmentation of the sentence as a list of tagged segments. The method, which is lexicon directed, is complete if the given (stem forms)...

متن کامل

Automatic Extraction of Hypernyms and Hyponyms from Russian Texts

The paper describes a rule-based approach for hypernym and hyponym extraction from Russian texts. For this task we employ finite state transducers (FSTs). We developed 6 finite state transducers that encode 6 lexicosyntactic patterns, which show a good precision on Russian DBpedia: 79.5% of the matched contexts are correct.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010